Enhancing program execution using optimization-driven inlining

ABSTRACT

Optimizing program execution includes performing, to obtain an expanded call graph, an expansion of an initial call graph. The expanded call graph includes nodes. The initial call graph is defined for a program that includes a root method and a child method. The method may further include calculating a cost value and a benefit value for inlining the child method, calculating an inlining priority value as a function of the cost value and the benefit value, and inlining, based on analyzing the expanded call graph and comparing the inlining priority value to a dynamic threshold, the child method into the root method. The child method may correspond to a node in the expanded call graph.

CROSS-REFERENCE TO RELATED APPLICATION

The present application is a continuation application of and, thereby,claims benefit under 35 U.S.C. § 120 to U.S. application Ser. No.15/917,482, entitled, “ENHANCING PROGRAM EXECUTION USINGOPTIMIZATION-DRIVEN INLINING,” filed on Mar. 9, 2018, having the sameinventors, and incorporated herein by reference.

BACKGROUND

When a computer program is written, the computer program is written assource code. A compiler is a software program that translates the sourcecode into object code, byte code, or assembly code. Object code or bytecode or assembly can be executed directly by a computer processor or avirtual machine. During compilation, the compiler may perform variousoptimizations. For example, optimizations may reduce the number ofinstructions executed by a computer processor. By performing theoptimizations, the compiler is able to provide more efficient use of thecomputer processor.

One way to benefit from the information spread across a call graph datastructure and to apply additional optimizations to the computer programis to replace the function calls with the respective function bodies, atransformation called inline expansion or inlining. Most compilers relyheavily on inlining, since inlining a function body is fast, enablesother optimizations, and does not require a whole-program analysis.

Although replacing a call-site (e.g., the location, or line of code,where the function is called) with the body of the callee function is asimple transformation, deciding which functions to inline is in practicedifficult. Consequently, in many compilers, inlining is based onhand-tuned heuristics and proverbial rules of thumb.

SUMMARY

In general, in one aspect, one or more embodiments relate to a method,system, and computer readable medium for optimizing program execution ofa program. The method includes performing, to obtain an expanded callgraph, an expansion of an initial call graph. The expanded call graphincludes nodes. The initial call graph is defined for a programincluding a root method and a child method. The method further includescalculating a cost value and a benefit value for inlining the childmethod, calculating an inlining priority value as a function of the costvalue and the benefit value, and inlining, based on analyzing theexpanded call graph and comparing the inlining priority value to adynamic threshold, the child method into the root method. The childmethod corresponds to a node in the expanded call graph.

The system includes memory and a computer processor configured toexecute a compiler stored in the memory. The compiler causes thecomputer processor to perform, to obtain an expanded call graph, anexpansion of an initial call graph. The expanded call graph includesnodes. The initial call graph is defined for a program including a rootmethod and a child method. The compiler further causes the computerprocessor to calculate a cost value and a benefit value for inlining thechild method, calculate an inlining priority value as a function of thecost value and the benefit value, and inline, based on analyzing theexpanded call graph and comparing the inlining priority value to adynamic threshold, the child method into the root method. The childmethod corresponds to a node in the expanded call graph.

The non-transitory computer readable medium includes instructions that,when executed by a computer processor, perform operations comprisingperforming, to obtain an expanded call graph, an expansion of an initialcall graph. The expanded call graph includes multiple nodes. The initialcall graph is defined for a program that includes a root method and achild method. The operations further comprising calculating a cost valueand a benefit value for inlining the child method, calculating aninlining priority value as a function of the cost value and the benefitvalue, and inlining, based on analyzing the expanded call graph andcomparing the inlining priority value to a dynamic threshold, the childmethod into the root method. The child method corresponds to a node inthe expanded call graph.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 shows a diagram of a system in accordance with one or moreembodiments of the invention.

FIG. 2 shows a state diagram of call graph nodes in accordance with oneor more embodiments of the invention.

FIGS. 3, 4, 5, and 6 show flowcharts in accordance with one or moreembodiments of the invention.

FIGS. 7A, 7B, 7C, 7D, 7E, 7F, 7G, and 7H show examples in accordancewith one or more embodiments of the invention.

FIGS. 8A and 8B show a computing system in accordance with one or moreembodiments of the invention.

DETAILED DESCRIPTION

Specific embodiments of the invention will now be described in detailwith reference to the accompanying figures. Like elements in the variousfigures are denoted by like reference numerals for consistency.

In the following detailed description of embodiments of the invention,numerous specific details are set forth in order to provide a morethorough understanding of the invention. However, it will be apparent toone of ordinary skill in the art that the invention may be practicedwithout these specific details. In other instances, well-known featureshave not been described in detail to avoid unnecessarily complicatingthe description.

Throughout the application, ordinal numbers (e.g., first, second, third,etc.) may be used as an adjective for an element (i.e., any noun in theapplication). The use of ordinal numbers is not to imply or create anyparticular ordering of the elements nor to limit any element to beingonly a single element unless expressly disclosed, such as by the use ofthe terms “before”, “after”, “single”, and other such terminology.Rather, the use of ordinal numbers is to distinguish between theelements. By way of an example, a first element is distinct from asecond element, and the first element may encompass more than oneelement and succeed (or precede) the second element in an ordering ofelements.

Embodiments of the inventions relate to an inlining procedure based onseveral concepts. One is that the call graph exploration is incremental.The procedure partially explores the call graph during the expansionstage, then switches to the inlining stage. These two stages alternateuntil a termination condition is met. Further, embodiments of theinvention relate to call graph exploration being prioritized using aratio of the inlining benefit and the inlining cost of the candidatecall-sites. Embodiments of the invention relate to inlining benefit,which is estimated by performing optimizations speculatively throughoutthe call graph, after replacing the function parameters with theconcrete call-site arguments, and by relying on the profile informationobtained during the prior execution of the program.

In one or more embodiments of the invention, cost-benefit analysisidentifies call graph subcomponents that should be inlined togetherusing a heuristic. Cost-benefit analysis is performed by analyzing ifinlining the call-site increases the benefit-per-cost ratio of thecaller. In one or more embodiments of the invention, inlining isbudget-driven: the minimum benefit-per-cost ratio required for inlininggrows dynamically with the amount of work performed by the invention.

FIG. 1 shows a system in accordance with one or more embodiments of theinvention. As shown in FIG. 1, the system is a computing system (101),such as the computer system shown in FIGS. 8A and 8B, and describedbelow. The computing system (101) includes a target program (110) isprovided to the compiler (111), which invokes a profiler (112) to assistwith creating a call graph (108). The compiler (111) executes on thecomputer system (101) to transform the provided target program (110) tobytecode or object code, or some other program representation. Theprofiler (112) analyzes dynamically the source code and identifiescritical sections of the code.

The computing system (101) also includes a data repository (102), whichstores the data used by or generated by the components of the computingsystem. For example, the data repository (102) may be a relationaldatabase, a hierarchical database, or any other form of repository ofdata. In one or more embodiments, the repository (102) is essentiallythe same as the repository shown and described in relation to thecomputing system in FIG. 8A.

Continuing with FIG. 1, the data repository (102) may include logs(103), a termination condition (104), a dynamic threshold (105), andexpansion threshold (106), profiling information (107). Within theprofiling information (107) is the call graph (108), which is a controlflow graph representing the relationship between subroutines (ormethods) in the target program (110). Using the expansion threshold(106) and the dynamic threshold (105), one or more embodiments proceedto evaluate whether to inline methods or not, depending on whether acertain termination condition (104) is met. The results of this activityare logged in the logs (103).

In one or more embodiments, the compiler (111) analyzes the methods ofthe target program (110). In one or more embodiments of the invention,the compiler (111) starts with a call graph consisting only of the rootnode (i.e. the compilation unit) and creates an expanded call graph. Theexpanded call graph is obtained by adding call graph nodes for callsitesinside some nodes that are not yet associated with their own (i.e., thecallsites' own) call graph nodes. In one or more embodiments, thecompiler (111) then inlines, based on an analysis of the expanded callgraph, one or more methods found within the target program (110) into aroot method. The compiler (111) then performs an optimization operationin response to inlining the method. The compiler (111) then updates theexpanded call graph based on the optimization operation to obtain anupdated call graph. The following process may be repeated multipletimes: the compiler (111) obtains an expanded call graph, and anexpansion of the updated call graph. Then, the compiler (111) inlines,based on an analysis of the expanded call graph, the method into theroot method. If certain termination conditions are met, the compiler(111) completes compilation of the target program (110). Details ofthese steps are shown and discussed in relation to FIG. 3.

In one or more embodiments, FIG. 2 shows the types and states of callgraph nodes (200), which are elements of the call graph, used in thisinvention, including Cutoff (C) Node (201), Inline Cache (I) Node (202),Deleted (D) Node (203), Generic (G) Node (204), and Explored (E) Node(205). A Cutoff Node (201) represents a call to a function whose bodyhas not been explored. An Inline Cache Node (202) represents calls thatcan dispatch to multiple known target functions. A Deleted Node (203)represents a call-site that was originally in the intermediaterepresentation, but was removed by a optimization. A Generic Node (204)represents a call to a function that will not be considered forinlining. An Explored Node (205) represents a call to a function whosebody was explored. In one or more embodiments, the compiler evaluatingmethods of the target program uses the above-named nodes as part of itsoptimization.

FIGS. 3, 4, 5 and 6 show flowcharts in accordance with one or moreembodiments of the invention. While the various steps in theseflowcharts are presented and described sequentially, one of ordinaryskill will appreciate that some or all the steps may be executed indifferent orders, may be combined or omitted, and some or all the stepsmay be executed in parallel. Furthermore, the steps may be performedactively or passively. For example, some steps may be performed usingpolling or be interrupt driven in accordance with one or moreembodiments of the invention. By way of an example, determination stepsmay not require a processor to process an instruction unless aninterrupt is received to signify that condition exists in accordancewith one or more embodiments of the invention. As another example,determination steps may be performed by performing a test, such aschecking a data value to test whether the value is consistent with thetested condition in accordance with one or more embodiments of theinvention.

FIG. 3 shows a method for the overall process of one or more embodimentsof the invention. In Step 301, the call graph of the program isexpanded. The expand function repetitively calls the ‘descend andexpand’ subroutine until the policy returns “true” from the subroutinethat checks whether the expansion is completed.

The expand policy subroutine ensures that the queue data structure ofeach node initially contains the children of that node, sorted by thepriority P. The priority can be computed as, but is not limited to, thevalue B/C, where B is the benefit of inlining that (and only that)specific node, and C is the code size increase resulting from inliningthe node. The ‘descend and expand’ subroutine descends on one path inthe call graph, by choosing a node with the highest priority, untilreaching a cutoff node, and then expands that node. If the ‘descend andexpand’ subroutine encounters an expanded node or an inline cache node,then the best child node is removed from the queue data structure, andthe subroutine recursively calls itself for that child node. If the nodereturned from the recursive call is not null or has a non-empty queue,then the child node is placed back on the expansion queue of the currentnode. Before returning the current node, the update metric subroutineupdates the metrics field. The metrics field contains variousinformation about the relevant subtree of the call graph, including, butnot limited to, total program size of all the call graph nodes in thatsubtree, or the number of cutoff nodes in that subtree. Otherwise, ifthe current node is a cutoff node (i.e. a leaf in the tree), then theexpand subroutine is called on the policy object.

In one or more embodiments, the expand subroutine may return either null(indicating that the respective cutoff should not be considered in thisround) or return a generic, expanded, or an inline cache node. In one ormore embodiments, the expansion of the call graph begins at the requestof a user of a computing device. In one or more embodiments, theexpansion of the call graph begins as a part of scheduled functionalityof a computing device. In one or more embodiments, the expansion of thecall graph of the program begins as a result of being invoked by othersoftware running on a computing device.

Step 302 analyzes the expanded call graph to select a child method ofthe program. Step 302 analyzes the expanded call graph to identifygroups of methods in the call graph that should be inlinedsimultaneously. Simultaneously is at the same time, overlapping times,or immediately one after the other. Each group of methods is assigned abenefit and a cost value. In one or more embodiments, the analysis ofthe expanded call graph is designed to be executable by the compiler.

Step 303 inlines a child method into a root method of the program. Inone or more embodiments, several groups of methods are inlined into theroot method of the program in Step 303. A group of methods is a set ofmethods whose inlining improve program performance only if the methodsin the set are inlined together, and can be inlined either entirely (ifthere is sufficient budget remaining), or not at all. In one or moreembodiments, the inlining of a child method is designed to be executableby the compiler.

Step 304 performs an optimization operation for inlining the one or morechild methods into the root method. In one or more embodiments, theoptimization operation for inlining the child method into the rootmethod is designed to be executable by the compiler.

Step 305 updates the expanded call graph based on the optimizationoperation. In one or more embodiments, the update of the expanded callgraph based on the optimization operation is designed to be executableby the compiler.

Step 306 checks to determine whether the termination condition issatisfied. In one or more embodiments, if the termination is satisfied,the process continues to Step 307. In one or more embodiments, if thetermination condition is not satisfied, the process returns to Step 301.

Step 307 completes the optimization of the program. In one or moreembodiments, completion of the optimization of the program is designedto be executable by the compiler.

FIG. 4 shows the expansion part of one or more embodiments of invention.Any of the steps shown in FIG. 4 may be designed to be executed by thecompiler. Initially, step 401 initializes priority queue for expansionof the call graph. The initial priority queue value is a function of theinitial benefit and the cost size.

Step 402 the determines whether the expansion is completed. In one ormore embodiments, if the expansion is completed, the process proceeds tothe END. The expansion is completed either when there are no more cutoffnodes to expand, or according to a heuristic. A heuristic can be, but isnot limited to, to check whether the benefit-per-cost ratio of thecutoff node exceeds the value e{circumflex over ( )}root-size−C1)/C2),where root-size is the size of the root method, and C1 and C2 areempirically derived constants. In one or more embodiments, if theexpansion is not done, the process proceeds to Step 403, which startsthe descend into the call graph. Step 403 marks the root node as thecurrent node.

Step 404 checks whether the node is of type explored or inlined. In oneor more embodiments, if the node is of type explored or inlined, theprocess proceeds to Step 405. In one or more embodiments, if the node isnot of type explored or inlined, the process proceeds to Step 406.

In one or more embodiments, step 405 assigns the new current node aschild of current node with the greatest expansion priority value. Uponcompletion of Step 405, the process proceeds back to Step 404.

In one or more embodiments, the benefit value is calculated as afunction of frequency of the number of times a method is called by theroot method, the number of optimizations triggered by the improvedcall-site arguments (which is determined by the expansion policy, fornodes of type C, G, D, E), and a function of probability of therespective child and the local benefit value, for nodes type I. Thebenefit can be estimated with, but not limited to, the expressionf(1+Ns), where f is the frequency with which the cutoff node is calledin the program, and Ns is the number of its parameters that canpotentially trigger optimizations after inlining. The cost value iscalculated as a function of the bytecode size for nodes type C; infinitefor nodes type G; 0 for nodes type D; the size of the intermediaterepresentation for nodes type E; and the sum of the cost value of thechildren of the root node for nodes type I.

Step 406 replaces the node with the node expansion. Step 407 records theoptimization. In one or more embodiments, Step 407 records theoptimizations triggered in the call graph by expanding the cutoff node.Finally, Step 408 updates the priority queue.

FIG. 5 shows the analysis part of one or more embodiments of theinvention. Any of the steps shown in FIG. 5 may be designed to beexecuted by the compiler. Initially, step 501 sets the worklist as nodesof the call graph ordered from bottom to top. During the analysis partof one or more embodiments, nodes are assigned cost-benefit tuples. Themerged cost-benefit tuple models the cumulative benefit and costobtained by inlining one call-site into another. The analysis is done inthe ‘analyze’ subroutine. The cost-benefit analysis proceeds bottom-up.First, the child nodes are analyzed. After these calls complete, thefollowing invariants hold for each child node m: (1) some connectedsubgraph B below m has the nodes with inlined set to true. Being set totrue indicates that if m were the root method, these descendants wouldbe inlined into m; (2) the tuple in m is set to the benefit and cost ofinlining the subgraph B into m. The subgraph B is heuristically chosenin a way such that its inlining maximally improves the benefit per costof the method m. Inlining some subset of the subgraph B may improve thebenefit per cost less, or even decrease it. More details regarding theanalysis are shown and provided in regard to FIG. 7G and FIG. 7H.

Step 502 checks whether the worklist is empty. In one or moreembodiments, if the worklist is empty then the process proceeds to END.Otherwise, if the worklist is not empty, then the process proceeds toStep 503. Step 503 selects the current node from the worklist.

Step 504 calculates the inlining priority value of the current node.Step 505 creates list of descendants of current node. Child nodes areput in a list, where the child nodes with the highest benefit-cost ratioare repetitively removed in a loop, while the other children are left inthe list.

Step 506 calculates the cost value and benefit value for inlining eachchild node in list of descendants. The cost value is calculated as afunction of the bytecode size for nodes type C; infinite for nodes typeG; 0 for nodes type D; the size of the intermediate representation fornodes type E; and the sum of the cost value of the children of the rootnode for nodes type I. The benefit value is calculated as a function offrequency of the number of times a method is called by the root method,the number of optimizations triggered by the improved call-sitearguments, which is determined by the expansion policy, for nodes oftype C, G, D, E; a function of probability of the respective child andthe local benefit value for nodes type I.

Step 507 calculates an inlining priority value as a function of the costvalue and the benefit value. Step 508 selects child node having greatestinlining priority value. Such use of priority values based on cost valueand benefit value is important to one or more embodiments of theinvention.

In Step 509 the inlining priority value of child node and inliningpriority value of current node is checked to determine whether thecriteria is satisfied. In one or more embodiments, if the inliningpriority value of child node and inlining priority value of current nodesatisfy criteria, then the process proceeds to Step 510. In one or moreembodiments, if inlining priority value of child node and inliningpriority value of current node does not satisfy criteria, then theprocess proceeds to Step 502.

In one or more embodiments, Step 510 removes child node from descendantlist, marks child node to inline, and adds children of child node todescendant list.

Step 511 checks whether the descendant list is empty. In one or moreembodiments, if the descendant list is empty, then the process proceedsto Step 502. In one or more embodiments, if the descendant list is notempty, then the process proceeds to Step 508. The inline priority valueis calculated as a function of the local benefit, and the cost ofinlining the node and of a reduced priority penalty. The calculation ofthe local benefit and the cost of inlining the node has been describedabove and the same methodology is used here. The priority penalty is afunction of the size of the intermediate representation of the nodes,the size of the bytecode, and several empirically determined constants.

FIG. 6 shows the inlining part of one or more embodiments of theinvention. Any of the steps shown in FIG. 6 may be designed to beexecuted by the compiler. In Step 601, the queue is initialized toinclude children of root method.

Step 602 checks whether the queue is empty. In one or more embodiments,if the queue is empty, then the process proceeds to Step 609. In one ormore embodiments, if the queue is not empty, then the process proceedsto Step 603. Step 603 selects node from queue.

Step 604 computes a cost value and a benefit value for inlining amethod. In one or more embodiments, the cost of expanded nodes is basedon the sum of the costs of the children that were previously markedinlined during the analysis part. Similarly, in one or more embodiments,the benefit of expanded nodes is based on the sum of the benefits ofinlining the children that were previously marked inlined during theanalysis part. The combination of inlining and expansion in this manneris an important improvement, whose goal is to model the inliningdecisions that each call graph node would make if it were the rootcompilation unit, and henceforth to decide whether it is more optimal toinline those methods into the callsite, or to compile them separately.Likewise, so is the use of iterative expansion and inlining of methods.In one or more embodiments, the cost of inlined nodes is based on thesize of the intermediate representation for the expanded nodes.

Step 605 computes an inlining priority value as a function of the costvalue and benefit value. Step 606 computes the cost value of the rootmethod based on the size of the method.

Step 607 calculates the dynamic threshold based on the size of the rootmethod and the explored part of the call graph. The use of dynamicthreshold to process nodes in the call graph is an importantimprovement. In Step 608 the dynamic threshold is evaluated to determinewhether it is satisfied. The dynamic threshold can be computed as, butnot limited to, the value e{circumflex over ( )}((root-size−C1)/C2),where root-size is the size of the root method, and C1 and C2 areempirically derived constants. In one or more embodiments of theinvention, if the dynamic threshold is satisfied, then the processproceeds to END. In one or more embodiments, if the queue is not empty,then the process proceeds to Step 610.

Step 609 applies loop peeling and escape analysis. Finally, step 610processes child nodes.

FIGS. 7A, 7B, 7C, 7D, 7E, 7F, 7G, and 7H show examples of calculatingpriorities as well as managing the call graph data structure.

FIG. 7A shows example source code (700) that is analyzed in the exampleshown in FIGS. 7B, 7C, 7D, 7E, 7F, 7G, and 7H. In particular, lines 1-3of the example source code are code for the half function. Lines 5-14 ofthe example source code (700) are code for the collatz function. Lines16-20 of the example source code (700) are code for the main function.FIG. 7B shows, in one or more embodiments of the invention, the initialcall graph for a main function (711), labeled E since it an explorednode, representing a call to a function whose body has been explored.The main function calls the collatz and error functions. In one or moreembodiments the methods collatz (712) and error (713) are, beforeinlining starts, labeled C for cutoff nodes, since they are functionswhose bodies have not been explored yet. The arrows pointing to mainshow that collatz (712) and error (713) are the children nodes of main(711).

FIG. 7C shows, in one or more embodiments, the call graph structureafter main (721), collatz (712) and error (713) from FIG. 7B have beenexplored. In FIG. 7C, in one or more embodiments, exploration hasgenerated another collatz (724) method, a half (725) method, and asecond collatz (726) method. In one or more embodiments, the old collatz(722) method has been labeled E since the method has been explored, theerror (723) method has been labeled G for generic, since the inliningprocedure could not determine a concrete target, so the method wasreplaced with this generic node. In one or more embodiments, the newnodes, the two collatz methods (724, 726) and the half (725) methods arelabeled C for cutoff nodes because the methods are functions whosebodies have not been explored yet. The call graph structure is built up,and later the cost-benefit analysis is applied to see how to improve thecall graph.

FIG. 7D shows, in one or more embodiments, the outcome after theinlining procedure has been applied to FIG. 7C, starting with main(731). In one or more embodiments, a deleted node, denoted with D, whichrepresents a call-site that was originally in the intermediaterepresentation, is removed by an optimization. Continuing with FIG. 7D,in one or more embodiments, the inlining algorithm propagates theconcrete call-site arguments into the body of the first collatz (732)call, and triggers an optimization. Consequently, in one or moreembodiments, the if-cascade in the collatz (736) body is optimized away,and the first recursive call to collatz (734) is removed. error (733) islabeled with G, half (735) is labelled as cutoff C, as is the secondcollatz (736).

FIG. 7E shows, in one or more embodiments, the state after the inliningprocedure has been applied to FIG. 7D. In one or more embodiments of theinvention, several more nodes (744, 747, 748, 749) have been labeled asdeleted nodes, denoted with D, while the explored nodes (741, 742, 745,746) have been labeled E. The error node (743) has been labeled generic,G. Although more nodes are generated to explore, several of the nodes(744, 747, 748, 749) will be deleted. This is an example of how moreopportunities are found for call graph expansion which were not known toexist at the beginning of the process. Accordingly, more opportunitiesfor optimization exist.

FIG. 7F shows that in one or more embodiments of the invention, thecompiler may alternatively conclude that the only implementations of theerror method (753) are in the StdLog (754) and FileLog (755) classes,and create an inline cache node with the respective children, and labelthose methods C (cutoff) and the error node (753) as I (inlined). Themain node (751) is labeled E (explored), while the collatz node (752) islabeled C (cutoff).

FIGS. 7G and 7H show, in one or more embodiments, the analysis part ofthe invention. The currently considered node is main (761) at depth 0,marked with an arrow. Each child node at depth 1 is analyzed—the collatz(762) on the left has the benefit|cost 2|4, and the generic error callon the right has the benefit|cost 1|inifinite. In addition, the analysisconcludes that the subgraph B of collatz (762) (namely, the nodes half(765) and collatz (764, 766) at depth 2) must be inlined together,because the arguments from the first collatz (762) considerably simplifythe half node (765) and the second collatz node (766). The child nodeerror (763) is labeled generic, G, with B_L|C values of 1|infinity. Thechild nodes collatz (767), half (777), collatz (778) are labeled fordeletion, D, with B_L|C values of 1|0. The analysis part of theinvention allows a decision to be made which nodes to keep and which todelete, based on the cost-benefit calculations.

The initial benefit B_I is calculated using the local benefit and thebenefit of the child nodes present. With the initial benefit B_I, thebenefit is modeled from inlining n, and the fact that no benefits frominlining the children of n has yet occurred. For most nodes, the initialbenefit B_I is a negative value. For example, the B_I for the mainmethod in FIG. 7G is B_I=1−2−1=−2.

FIGS. 7G and 7H show, in one or more embodiments, the values of theanalysis of some methods. The initial cost benefit tuple isB_I(main)|C(main)=−2|5. The best child is collatz (762) with the tuple214. The merged tuple is 019, which is better than −2|5, so the collatz(772) call is marked for inlining (see FIG. 7H). In the inlining stage,the collatz (774, 776) and half (775) calls at depth 2 are also inlinedinto main (771), since the calls are a part of the marked connectedsubgraph. The error (773) call is in this case generic, and cannotimprove main further.

Embodiments of the invention may be implemented on a computing system.Any combination of mobile, desktop, server, router, switch, embeddeddevice, or other types of hardware may be used. For example, as shown inFIG. 8A, the computing system (800) may include one or more computerprocessors (802), non-persistent storage (804) (e.g., volatile memory,such as random access memory (RAM), cache memory), persistent storage(806) (e.g., a hard disk, an optical drive such as a compact disk (CD)drive or digital versatile disk (DVD) drive, a flash memory, etc.), acommunication interface (812) (e.g., Bluetooth interface, infraredinterface, network interface, optical interface, etc.), and numerousother elements and functionalities.

The computer processor(s) (802) may be an integrated circuit forprocessing instructions. For example, the computer processor(s) may beone or more cores or micro-cores of a processor. The computing system(800) may also include one or more input devices (810), such as atouchscreen, keyboard, mouse, microphone, touchpad, electronic pen, orany other type of input device.

The communication interface (812) may include an integrated circuit forconnecting the computing system (800) to a network (not shown) (e.g., alocal area network (LAN), a wide area network (WAN) such as theInternet, mobile network, or any other type of network) and/or toanother device, such as another computing device.

Further, the computing system (800) may include one or more outputdevices (808), such as a screen (e.g., a liquid crystal display (LCD), aplasma display, touchscreen, cathode ray tube (CRT) monitor, projector,or other display device), a printer, external storage, or any otheroutput device. One or more of the output devices may be the same ordifferent from the input device(s). The input and output device(s) maybe locally or remotely connected to the computer processor(s) (802),non-persistent storage (804), and persistent storage (806). Manydifferent types of computing systems exist, and the aforementioned inputand output device(s) may take other forms.

Software instructions in the form of computer readable program code toperform embodiments of the invention may be stored, in whole or in part,temporarily or permanently, on a non-transitory computer readable mediumsuch as a CD, DVD, storage device, a diskette, a tape, flash memory,physical memory, or any other computer readable storage medium.Specifically, the software instructions may correspond to computerreadable program code that, when executed by a processor(s), isconfigured to perform one or more embodiments of the invention.

The computing system (800) in FIG. 8A may be connected to or be a partof a network. For example, as shown in FIG. 8B, the network (820) mayinclude multiple nodes (e.g., node X (822), node Y (824)). Each node maycorrespond to a computing system, such as the computing system shown inFIG. 8A, or a group of nodes combined may correspond to the computingsystem shown in FIG. 8A. By way of an example, embodiments of theinvention may be implemented on a node of a distributed system that isconnected to other nodes. By way of another example, embodiments of theinvention may be implemented on a distributed computing system havingmultiple nodes, where each portion of the invention may be located on adifferent node within the distributed computing system. Further, one ormore elements of the aforementioned computing system (800) may belocated at a remote location and connected to the other elements over anetwork.

Although not shown in FIG. 8B, the node may correspond to a blade in aserver chassis that is connected to other nodes via a backplane. By wayof another example, the node may correspond to a server in a datacenter. By way of another example, the node may correspond to a computerprocessor or micro-core of a computer processor with shared memoryand/or resources.

The nodes (e.g., node X (822), node Y (824)) in the network (820) may beconfigured to provide services for a client device (826). For example,the nodes may be part of a cloud computing system. The nodes may includefunctionality to receive requests from the client device (826) andtransmit responses to the client device (826). The client device (826)may be a computing system, such as the computing system shown in FIG.8A. Further, the client device (826) may include and/or perform all or aportion of one or more embodiments of the invention.

The computing system or group of computing systems described in FIGS. 8Aand 8B may include functionality to perform a variety of operationsdisclosed herein. For example, the computing system(s) may performcommunication between processes on the same or different system. Avariety of mechanisms, employing some form of active or passivecommunication, may facilitate the exchange of data between processes onthe same device. Examples representative of these inter-processcommunications include, but are not limited to, the implementation of afile, a signal, a socket, a message queue, a pipeline, a semaphore,shared memory, message passing, and a memory-mapped file. Furtherdetails pertaining to a couple of these non-limiting examples areprovided below.

Based on the client-server networking model, sockets may serve asinterfaces or communication channel end-points enabling bidirectionaldata transfer between processes on the same device. Foremost, followingthe client-server networking model, a server process (e.g., a processthat provides data) may create a first socket object. Next, the serverprocess binds the first socket object, thereby associating the firstsocket object with a unique name and/or address. After creating andbinding the first socket object, the server process then waits andlistens for incoming connection requests from one or more clientprocesses (e.g., processes that seek data). At this point, when a clientprocess wishes to obtain data from a server process, the client processstarts by creating a second socket object. The client process thenproceeds to generate a connection request that includes at least thesecond socket object and the unique name and/or address associated withthe first socket object. The client process then transmits theconnection request to the server process. Depending on availability, theserver process may accept the connection request, establishing acommunication channel with the client process, or the server process,busy in handling other operations, may queue the connection request in abuffer until server process is ready. An established connection informsthe client process that communications may commence. In response, theclient process may generate a data request specifying the data that theclient process wishes to obtain. The data request is subsequentlytransmitted to the server process. Upon receiving the data request, theserver process analyzes the request and gathers the requested data.Finally, the server process then generates a reply including at leastthe requested data and transmits the reply to the client process. Thedata may be transferred, more commonly, as datagrams or a stream ofcharacters (e.g., bytes).

Shared memory refers to the allocation of virtual memory space in orderto substantiate a mechanism for which data may be communicated and/oraccessed by multiple processes. In implementing shared memory, aninitializing process first creates a shareable segment in persistent ornon-persistent storage. Post creation, the initializing process thenmounts the shareable segment, subsequently mapping the shareable segmentinto the address space associated with the initializing process.Following the mounting, the initializing process proceeds to identifyand grant access permission to one or more authorized processes that mayalso write and read data to and from the shareable segment. Changes madeto the data in the shareable segment by one process may immediatelyaffect other processes, which are also linked to the shareable segment.Further, when one of the authorized processes accesses the shareablesegment, the shareable segment maps to the address space of thatauthorized process. Often, only one authorized process may mount theshareable segment, other than the initializing process, at any giventime.

Other techniques may be used to share data, such as the various datadescribed in the present application, between processes withoutdeparting from the scope of the invention.

By way of another example, a request to obtain data regarding theparticular item may be sent to a server operatively connected to theuser device through a network. For example, the user may select auniform resource locator (URL) link within a web client of the userdevice, thereby initiating a Hypertext Transfer Protocol (HTTP) or otherprotocol request being sent to the network host associated with the URL.In response to the request, the server may extract the data regardingthe particular selected item and send the data to the device thatinitiated the request. Once the user device has received the dataregarding the particular item, the contents of the received dataregarding the particular item may be displayed on the user device inresponse to the user's selection. Further to the above example, the datareceived from the server after selecting the URL link may provide a webpage in Hyper Text Markup Language (HTML) that may be rendered by theweb client and displayed on the user device.

Once data is obtained, such as by using techniques described above orfrom storage, the computing system, in performing one or moreembodiments of the invention, may extract one or more data items fromthe obtained data. For example, the extraction may be performed asfollows by the computing system in FIG. 8A. First, the organizingpattern (e.g., grammar, schema, layout) of the data is determined, whichmay be based on one or more of the following: position (e.g., bit orcolumn position, Nth token in a data stream, etc.), attribute (where theattribute is associated with one or more values), or a hierarchical/treestructure (consisting of layers of nodes at different levels ofdetail-such as in nested packet headers or nested document sections).Then, the raw, unprocessed stream of data symbols is parsed, in thecontext of the organizing pattern, into a stream (or layered structure)of tokens (where each token may have an associated token “type”).

Next, extraction criteria are used to extract one or more data itemsfrom the token stream or structure, where the extraction criteria areprocessed according to the organizing pattern to extract one or moretokens (or nodes from a layered structure). For position-based data, thetoken(s) at the position(s) identified by the extraction criteria areextracted. For attribute/value-based data, the token(s) and/or node(s)associated with the attribute(s) satisfying the extraction criteria areextracted. For hierarchical/layered data, the token(s) associated withthe node(s) matching the extraction criteria are extracted. Theextraction criteria may be as simple as an identifier string or may be aquery presented to a structured data repository (where the datarepository may be organized according to a database schema or dataformat, such as XML).

The extracted data may be used for further processing by the computingsystem. For example, the computing system of FIG. 8A, while performingone or more embodiments of the invention, may perform data comparison.Data comparison may be used to compare two or more data values (e.g., A,B). For example, one or more embodiments may determine whether A>B, A=B,A !=B, A<B, etc. The comparison may be performed by submitting A, B, andan opcode specifying an operation related to the comparison into anarithmetic logic unit (ALU) (i.e., circuitry that performs arithmeticand/or bitwise logical operations on the two data values). The ALUoutputs the numerical result of the operation and/or one or more statusflags related to the numerical result. For example, the status flags mayindicate whether the numerical result is a positive number, a negativenumber, zero, etc. By selecting the proper opcode and then reading thenumerical results and/or status flags, the comparison may be executed.For example, in order to determine if A>B, B may be subtracted from A(i.e., A−B), and the status flags may be read to determine if the resultis positive (i.e., if A>B, then A−B>0). In one or more embodiments, Bmay be considered a threshold, and A is deemed to satisfy the thresholdif A=B or if A>B, as determined using the ALU. In one or moreembodiments of the invention, A and B may be vectors, and comparing Awith B requires comparing the first element of vector A with the firstelement of vector B, the second element of vector A with the secondelement of vector B, etc. In one or more embodiments, if A and B arestrings, the binary values of the strings may be compared.

The computing system in FIG. 8A may implement and/or be connected to adata repository. For example, one type of data repository is a database.A database is a collection of information configured for ease of dataretrieval, modification, re-organization, and deletion. DatabaseManagement System (DBMS) is a software application that provides aninterface for users to define, create, query, update, or administerdatabases.

The user, or software application, may submit a statement or query intothe DBMS. Then the DBMS interprets the statement. The statement may be aselect statement to request information, update statement, createstatement, delete statement, etc. Moreover, the statement may includeparameters that specify data, or data container (database, table,record, column, view, etc.), identifier(s), conditions (comparisonoperators), functions (e.g. join, full join, count, average, etc.), sort(e.g. ascending, descending), or others. The DBMS may execute thestatement. For example, the DBMS may access a memory buffer, a referenceor index a file for read, write, deletion, or any combination thereof,for responding to the statement. The DBMS may load the data frompersistent or non-persistent storage and perform computations to respondto the query. The DBMS may return the result(s) to the user or softwareapplication.

The computing system of FIG. 8A may include functionality to present rawand/or processed data, such as results of comparisons and otherprocessing. For example, presenting data may be accomplished throughvarious presenting methods. Specifically, data may be presented througha user interface provided by a computing device. The user interface mayinclude a GUI that displays information on a display device, such as acomputer monitor or a touchscreen on a handheld computer device. The GUImay include various GUI widgets that organize what data is shown as wellas how data is presented to a user. Furthermore, the GUI may presentdata directly to the user, e.g., data presented as actual data valuesthrough text, or rendered by the computing device into a visualrepresentation of the data, such as through visualizing a data model.

For example, a GUI may first obtain a notification from a softwareapplication requesting that a particular data object be presented withinthe GUI. Next, the GUI may determine a data object type associated withthe particular data object, e.g., by obtaining data from a dataattribute within the data object that identifies the data object type.Then, the GUI may determine any rules designated for displaying thatdata object type, e.g., rules specified by a software framework for adata object class or according to any local parameters defined by theGUI for presenting that data object type. Finally, the GUI may obtaindata values from the particular data object and render a visualrepresentation of the data values within a display device according tothe designated rules for that data object type.

Data may also be presented through various audio methods. In particular,data may be rendered into an audio format and presented as sound throughone or more speakers operably connected to a computing device.

Data may also be presented to a user through haptic methods. Forexample, haptic methods may include vibrations or other physical signalsgenerated by the computing system. For example, data may be presented toa user using a vibration generated by a handheld computer device with apredefined duration and intensity of the vibration to communicate thedata.

The above description of functions presents only a few examples offunctions performed by the computing system of FIG. 8A. Other functionsmay be performed using one or more embodiments of the invention.

While the invention has been described with respect to a limited numberof embodiments, those skilled in the art, having benefit of thisdisclosure, will appreciate that other embodiments can be devised whichdo not depart from the scope of the invention as disclosed herein.Accordingly, the scope of the invention should be limited only by theattached claims.

What is claimed is:
 1. A method for optimizing program execution of aprogram, the method comprising: performing, to obtain an expanded callgraph, an expansion of an initial call graph, the expanded call graphcomprising a plurality of nodes, the initial call graph being definedfor a program comprising a root method and a child method; calculating acost value and a benefit value for inlining the child method;calculating an inlining priority value as a function of the cost valueand the benefit value; and inlining, based on analyzing the expandedcall graph and comparing the inlining priority value to a dynamicthreshold, the child method into the root method, the child methodcorresponding to a node of the plurality of nodes in the expanded callgraph.
 2. The method of claim 1, further comprising: calculating thedynamic threshold as a function of memory usage.
 3. The method of claim1, wherein the cost value is calculated as a size of a programrepresentation of the node.
 4. The method of claim 1, furthercomprising: calculating an expansion priority value for each of aplurality of child nodes of a parent node, the plurality of child nodesand the parent node in the plurality of nodes; selecting a child nodefrom the plurality of child nodes based on the expansion priority value;and replacing a child node with an expanded child node to obtain theexpanded call graph.
 5. The method of claim 4, wherein the child node isexpanded based on the expansion priority value satisfying an expansionthreshold, the expansion threshold is a function of a size of anintermediate representation of the initial call graph.
 6. The method ofclaim 4, further comprising: calculating the expansion priority valuefor each of the plurality of nodes; and traversing, to perform theexpansion, the plurality of nodes in order defined by the expansionpriority value of each of the plurality of nodes.
 7. The method of claim1, further comprising: inlining, based on an analysis of the expandedcall graph indicating simultaneously inlining, a plurality of methodsinto the root method simultaneously, the plurality of methods comprisingthe child method.
 8. A system for optimizing program execution of aprogram, the system comprising: memory; and a computer processorconfigured to execute a compiler stored in the memory, the compiler forcausing the computer processor to: perform, to obtain an expanded callgraph, an expansion of an initial call graph, the expanded call graphcomprising a plurality of nodes, the initial call graph being definedfor a program comprising a root method and a child method; calculate acost value and a benefit value for inlining the child method; calculatean inlining priority value as a function of the cost value and thebenefit value; and inline, based on analyzing the expanded call graphand comparing the inlining priority value to a dynamic threshold, thechild method into the root method, the child method corresponding to anode of the plurality of nodes in the expanded call graph.
 9. The systemof claim 8, wherein the computer processor further: calculates thedynamic threshold as a function of memory usage.
 10. The system of claim8, wherein the computer processor calculates the cost value as a size ofa program representation of the node.
 11. The system of claim 8, whereinthe computer processor further: calculates an expansion priority valuefor each of a plurality of child nodes of a parent node, the pluralityof child nodes and the parent node in the plurality of nodes; selects achild node from the plurality of child nodes based on the expansionpriority value; and replaces a child node with an expanded child node toobtain the expanded call graph.
 12. The system of claim 11, wherein thecomputer processor expands the child node based on the expansionpriority value satisfying an expansion threshold, the expansionthreshold is a function of a size of an intermediate representation ofthe initial call graph.
 13. The system of claim 11, wherein the computerprocessor further: calculates the expansion priority value for each ofthe plurality of nodes; and traverses, to perform the expansion, theplurality of nodes in order defined by the expansion priority value ofeach of the plurality of nodes.
 14. The system of claim 8, wherein thecomputer processor further: inlines, based on an analysis of theexpanded call graph indicating simultaneously inlining, a plurality ofmethods into the root method simultaneously, the plurality of methodscomprising the child method.
 15. A non-transitory computer readablemedium comprising instructions that, when executed by a computerprocessor, perform operations comprising: performing, to obtain anexpanded call graph, an expansion of an initial call graph, the expandedcall graph comprising a plurality of nodes, the initial call graph beingdefined for a program comprising a root method and a child method;calculating a cost value and a benefit value for inlining the childmethod; calculating an inlining priority value as a function of the costvalue and the benefit value; and inlining, based on analyzing theexpanded call graph and comparing the inlining priority value to adynamic threshold, the child method into the root method, the childmethod corresponding to a node of the plurality of nodes in the expandedcall graph.
 16. The non-transitory computer readable medium of claim 15,wherein the operations further comprise: calculating the dynamicthreshold as a function of memory usage.
 17. The non-transitory computerreadable medium of claim 15, wherein the cost value is calculated as asize of a program representation of the node.
 18. The non-transitorycomputer readable medium of claim 15, wherein the operations furthercomprise: calculating an expansion priority value for each of aplurality of child nodes of a parent node, the plurality of child nodesand the parent node in the plurality of nodes; selecting a child nodefrom the plurality of child nodes based on the expansion priority value;and replacing a child node with an expanded child node to obtain theexpanded call graph.
 19. The non-transitory computer readable medium ofclaim 18, wherein the child node is expanded based on the expansionpriority value satisfying an expansion threshold, the expansionthreshold is a function of a size of an intermediate representation ofthe initial call graph.
 20. The non-transitory computer readable mediumof claim 18, wherein the operations further comprise: calculating theexpansion priority value for each of the plurality of nodes; andtraversing, to perform the expansion, the plurality of nodes in orderdefined by the expansion priority value of each of the plurality ofnodes.